FilterBoost: Regression and Classification on Large Datasets

نویسندگان

Joseph K. Bradley

Robert E. Schapire

چکیده

We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm FilterBoost, which is based on a logistic regression technique proposed by Collins et al. (2002), requires fewer assumptions to achieve bounds equivalent to or better than previous work. Our proofs demonstrate the algorithm’s strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification. We give several extensions of FilterBoost to the multiclass case, proving PAC bounds on each. In particular, we make use of the ideas of pseudoloss and Error-Correcting Output Codes used by Freund and Schapire (1997) and Schapire (1997) to create easily implementable boosters, and we show that the generalization of Output Codes by Allwein et al. (2000) extends to FilterBoost. This work represents one of the first studies of boosting-byfiltering for multiclass problems. FilterBoost: Regression and Classification on Large Datasets

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets

Background and Purpose: Nowadays, breast cancer is reported as one of the most common cancers amongst women. Early detection of the cancer type is essential to aid in informing subsequent treatments. The newest proposed breast cancer detectors are based on deep learning. Most of these works focus on large-datasets and are not developed for small datasets. Although the large datasets might lead ...

متن کامل

SFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy

In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....

متن کامل

Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease

Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...

متن کامل

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...

متن کامل

Chemometrics-enhanced Classification of Source Rock Samples Using their Bulk Geochemical Data: Southern Persian Gulf Basin

Chemometric methods can enhance geochemical interpretations, especially when working with large datasets. With this aim, exploratory hierarchical cluster analysis (HCA) and principal component analysis (PCA) methods are used herein to study the bulk pyrolysis parameters of 534 samples from the Persian Gulf basin. These methods are powerful techniques for identifying the patterns of variations i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

FilterBoost: Regression and Classification on Large Datasets

نویسندگان

چکیده

منابع مشابه

A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets

SFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy

Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Chemometrics-enhanced Classification of Source Rock Samples Using their Bulk Geochemical Data: Southern Persian Gulf Basin

عنوان ژورنال:

اشتراک گذاری